# Install and load packages 
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
devtools::install_github("ebenmichael/augsynth")
## Using GitHub PAT from the git credential store.
## Skipping install of 'augsynth' from a github remote, the SHA1 (982f650b) has not changed since last install.
##   Use `force = TRUE` to force installation
pacman::p_load(# Tidyverse packages including dplyr and ggplot2 
               tidyverse,
               ggthemes,
               augsynth,
               gsynth)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# set seed
set.seed(44)

# load data
medicaid_expansion <- read_csv("data/medicaid_expansion.csv")
## Rows: 663 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): State
## dbl  (3): year, uninsured_rate, population
## date (1): Date_Adopted
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(medicaid_expansion)
## [1] "State"          "Date_Adopted"   "year"           "uninsured_rate"
## [5] "population"

Introduction

For this project, you will explore the question of whether the Affordable Care Act increased health insurance coverage (or conversely, decreased the number of people who are uninsured). The ACA was passed in March 2010, but several of its provisions were phased in over a few years. The ACA instituted the “individual mandate” which required that all Americans must carry health insurance, or else suffer a tax penalty. There are four mechanisms for how the ACA aims to reduce the uninsured population:

In 2012, the Supreme Court heard the landmark case NFIB v. Sebelius, which principally challenged the constitutionality of the law under the theory that Congress could not institute an individual mandate. The Supreme Court ultimately upheld the individual mandate under Congress’s taxation power, but struck down the requirement that states must expand Medicaid as impermissible subordination of the states to the federal government. Subsequently, several states refused to expand Medicaid when the program began on January 1, 2014. This refusal created the “Medicaid coverage gap” where there are indivudals who earn too much to qualify for Medicaid under the old standards, but too little to qualify for the ACA subsidies targeted at middle-income individuals.

States that refused to expand Medicaid principally cited the cost as the primary factor. Critics pointed out however, that the decision not to expand primarily broke down along partisan lines. In the years since the initial expansion, several states have opted into the program, either because of a change in the governing party, or because voters directly approved expansion via a ballot initiative.

You will explore the question of whether Medicaid expansion reduced the uninsured population in the U.S. in the 7 years since it went into effect. To address this question, you will use difference-in-differences estimation, and synthetic control.

Data

The dataset you will work with has been assembled from a few different sources about Medicaid. The key variables are:

Exploratory Data Analysis

Create plots and provide 1-2 sentence analyses to answer the following questions:

# Highest and lowest uninsured rates
#Find years in dataset
unique(medicaid_expansion$year)
##  [1] 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
# Filter data for all years prior to 2014
pre2014_data <- medicaid_expansion %>%
  filter(year < 2014)

# Uninsured rates prior to 2014 (Going to take the mean of all years prior, not taking it year by year, will update if this is a mistake)
pre2014_uninsuredrate <- pre2014_data %>%
  group_by(State) %>%
  summarize(mean_uninsured_rate = mean(uninsured_rate)) %>%
  arrange(desc(mean_uninsured_rate)) %>%
  mutate(State = factor(State, levels = State)) 

#Find top and bottom quartiles (calling this the highest and lowest, since I don't want to plot all 50 states)
# Calculate the 25th and 75th percentiles (Quartiles)
q1 <- quantile(pre2014_uninsuredrate$mean_uninsured_rate, 0.25)
q3 <- quantile(pre2014_uninsuredrate$mean_uninsured_rate, 0.75)

# Filter states in the bottom and top quartiles based on mean uninsured rate
bottom_quartile2014 <- pre2014_uninsuredrate %>%
  filter(mean_uninsured_rate <= q1) %>%
  mutate(Quartile = 'Bottom Quartile')

top_quartile2014 <- pre2014_uninsuredrate %>%
  filter(mean_uninsured_rate >= q3) %>%
  mutate(Quartile = 'Top Quartile')

# Combine both quartiles into a single data frame
combined_quartiles <- bind_rows(bottom_quartile2014, top_quartile2014)

# Create a plot for both top and bottom quartiles
ggplot(combined_quartiles, aes(x = reorder(State, mean_uninsured_rate), y = mean_uninsured_rate, fill = Quartile)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(title = 'States with the Highest and Lowest Uninsured Rates Prior to 2014', 
       x = 'State', 
       y = 'Mean Uninsured Rate (%)') +
  scale_fill_manual(values = c('Bottom Quartile' = 'lightgreen', 'Top Quartile' = 'lightcoral')) +
  theme_minimal()

I decided to plot uninsured population rates pre-2014 as an average of those rates over the pre-2014 years in the study period (2008-2013). I understand that I lose some naunce here because rates could be way higher/lower at some point in this pre-2014 period leading to a deceiving average, whereas looking at this yearly would allow me to see those trends over time pre-2014, but for the sake of exploratory analyses, I like this visual. I will return to this if it seems like later on I should be looking year to year. It looks like the states with the lowest rates of uninsured individuals pre-2014 are mostly northern/midwestern states, and the highest rates of uninsured individuals are southern/western states. P.S., it’s noted that D.C. is missing population data. If that becomes an issue, I’ll take it out of the dataset.

library(dplyr)
# most uninsured Americans
# Calculate total uninsured population for each state prior to 2014 
pre2014_uninsuredrate <- pre2014_data %>%
  group_by(State) %>%
  summarize(
    mean_uninsured_rate = mean(uninsured_rate),
    population = first(population)  #Since there's only one unique value for each state (because it's based on 2010 population), just choosing the first
  ) %>%
  mutate(
    total_uninsured_population = population * mean_uninsured_rate
  ) %>%
  arrange(desc(total_uninsured_population)) %>%
  mutate(State = factor(State, levels = State))

#Caluclate total uninsured population for each state in 2020
uninsured_2020 <- medicaid_expansion %>%
  filter(year == 2020) %>%
  group_by(State) %>%
  summarize(
    uninsured_rate_2020 = first(uninsured_rate),  # Use the uninsured rate directly for 2020
    population = first(population)  # Use the population for the state
  ) %>%
  mutate(
    total_uninsured_population_2020 = population * uninsured_rate_2020
  ) %>%
  arrange(desc(total_uninsured_population_2020))
glimpse(pre2014_uninsuredrate)
## Rows: 51
## Columns: 4
## $ State                      <fct> California, Texas, Florida, New York, Georg…
## $ mean_uninsured_rate        <dbl> 0.1796478, 0.1992837, 0.2068583, 0.1145605,…
## $ population                 <dbl> 38802500, 26956958, 19893297, 19746227, 100…
## $ total_uninsured_population <dbl> 6970785.1, 5372081.4, 4115094.3, 2262137.6,…
glimpse(uninsured_2020)
## Rows: 51
## Columns: 4
## $ State                           <chr> "Texas", "California", "Florida", "Geo…
## $ uninsured_rate_2020             <dbl> 0.184, 0.077, 0.132, 0.134, 0.113, 0.0…
## $ population                      <dbl> 26956958, 38802500, 19893297, 10097343…
## $ total_uninsured_population_2020 <dbl> 4960080.3, 2987792.5, 2625915.2, 13530…
# Filter states in the top quartiles pre-2014 and in 2020 based on total uninsured population rates
library(ggplot2)
library(scales)  # for comma format
library (tidyr)
#Define cutoffs
q3_2014 <- quantile(pre2014_uninsuredrate$total_uninsured_population, 0.75, na.rm = TRUE)
q3_2020 <- quantile(uninsured_2020$total_uninsured_population_2020, 0.75, na.rm = TRUE)

#Filter states in the top quartile for both time periods 
pre2014_top <- pre2014_uninsuredrate %>%
  mutate(Year = "Pre-2014") %>%
  mutate(in_top_quartile_2014 = total_uninsured_population >= q3_2014) %>%
  select(State, total_uninsured_population_2014 = total_uninsured_population, in_top_quartile_2014)

post2020_top <- uninsured_2020 %>%
  mutate(Year = "2020") %>%
  mutate(in_top_quartile_2020 = total_uninsured_population_2020 >= q3_2020) %>%
  select(State, total_uninsured_population_2020, in_top_quartile_2020)

# Full join on State to preserve all states that were in either top quartile
combined_all_states <- full_join(pre2014_top, post2020_top, by = "State")

# Filter out states that weren't in the top quartile for either time period
combined_all_states <- combined_all_states %>%
  filter(in_top_quartile_2014 == TRUE | in_top_quartile_2020 == TRUE)

# Reformat data for plotting 
plot_data <- combined_all_states %>%
  pivot_longer(cols = starts_with("total_uninsured_population"),
               names_to = "Year",
               values_to = "total_uninsured_population") %>%
  mutate(Year = recode(Year,
                       "total_uninsured_population_2014" = "Pre-2014",
                       "total_uninsured_population_2020" = "2020"))

# Create a plot for both top quartiles, adding the data for a state if it was in the top quartile for either time period 
ggplot(plot_data, aes(x = reorder(State, -total_uninsured_population, na.rm = TRUE),
                      y = total_uninsured_population,
                      fill = Year)) +
  geom_bar(stat = "identity", position = "dodge", na.rm = TRUE) +
  labs(title = "Top Quartile States (Pre2014 and/or 2020) by Total Uninsured Population",
       x = "State",
       y = "Total Uninsured Population") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

It looks like the total population of uninsured people went down from pre-2014 levels for all states in 2020. California seems to have a huge difference, Florida and New York too, and then Texas looks like it didn’t change that much. It’s important to note that the graph includes states that were in the top quartile of total uninsured individuals for either pre-2014 or in 2020, so some of these states might not have been in the top quartile for both time periods.

Difference-in-Differences Estimation

Estimate Model

Do the following:

  • Choose a state that adopted the Medicaid expansion on January 1, 2014 and a state that did not. Hint: Do not pick Massachusetts as it passed a universal healthcare law in 2006, and also avoid picking a state that adopted the Medicaid expansion between 2014 and 2015.
  • Assess the parallel trends assumption for your choices using a plot. If you are not satisfied that the assumption has been met, pick another state and try again (but detail the states you tried).
#Find out when states adopted medicaid expansion, create a column indicating whether or not state ever adopted  
medicaid_adoption <- medicaid_expansion %>%
  distinct(State, Date_Adopted) %>%
  mutate(
    Adopted = ifelse(is.na(Date_Adopted), "Never Expanded", "Expanded")
  ) %>%
  arrange(Date_Adopted)

#Merge with dataset
medicaid_expansion$Date_Adopted <- as.Date(medicaid_expansion$Date_Adopted)
medicaid_adoption$Date_Adopted <- as.Date(medicaid_adoption$Date_Adopted)
medicaid_expansion_full <- medicaid_expansion %>%
  left_join(medicaid_adoption %>% select(State, Date_Adopted, Adopted), by = "State") %>%
  select(-Date_Adopted.x) %>%
  rename(Date_Adopted = Date_Adopted.y)
colnames(medicaid_expansion_full)
## [1] "State"          "year"           "uninsured_rate" "population"    
## [5] "Date_Adopted"   "Adopted"
#When it says "choose a state that did not" I'm assuming this means choose a state that never adopted the expansion, not a state that adopted the expansion just not between 2014-205 (e.g. 2020)
#Find states that adopted expansion on 1/1/2014 and those who didn't at all, create a new dataframe to work on 
compare_states <- medicaid_expansion_full %>%
  filter(Date_Adopted == as.Date("2014-01-01") | Adopted == "Never Expanded")
compare_states <- compare_states %>%
  left_join(pre2014_uninsuredrate %>%
              select(State, total_uninsured_population, mean_uninsured_rate),
            by = "State")

I had such a hard time trying to figure out how to show parallel trends lol. As I’ve griped about in class, I feel like parallel trends are pretty unconvincing, it’s pretty easy for me to think something fails the eye test unless the lines are, like, really similar.

#Parallel trends plot
parallel_data <- compare_states %>%
  filter(State %in% c("California", "Georgia")) %>%
  select(State, year, uninsured_rate)

parallel_data <- parallel_data %>%
  mutate(year = as.integer(year))

ggplot(parallel_data, aes(x = year, y = uninsured_rate, color = State)) +
  geom_line(size = 1.2) +
  geom_point() +
  scale_x_continuous(breaks = seq(min(parallel_data$year), 2014, 1),
                     labels = as.character(seq(min(parallel_data$year), 2014, 1))) +
  coord_cartesian(xlim = c(min(parallel_data$year), 2014)) +  # truncate the plot at 2014
  labs(title = "Pre-2014 Uninsured Rate Trends: California vs. Georgia",
       x = "Year",
       y = "Uninsured Rate",
       color = "State") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Find two states, one that never adopted the expansion and one that adopted 1/1/2014 that are similar in terms of uninsured_rates, populations, and total_uninsured populations prior to 2014
compare_states %>%
  filter(State %in% c("California", "Georgia")) %>%
  distinct(State, population, Date_Adopted, mean_uninsured_rate, total_uninsured_population) %>%
  print()
## # A tibble: 2 × 5
##   State      population Date_Adopted mean_uninsured_rate total_uninsured_popul…¹
##   <chr>           <dbl> <date>                     <dbl>                   <dbl>
## 1 California   38802500 2014-01-01                 0.180                6970785.
## 2 Georgia      10097343 NA                         0.192                1935272.
## # ℹ abbreviated name: ¹​total_uninsured_population

This was the best I could do… the parallel trends look visually the best out of any pair I tested (I spent like an hour and a half trying), but I know obviously that there are reasons to believe that Georgia and California are very different. That being said, when weighing the visual parallel trends vs. our real world knowledge of these policy contexts, which is important? For example, I kept trying jurisdictions that seemed similar enough (Kentucky vs. Alabama) but their parallel trends visually just couldn’t be justified to me. Not sure if this was the better move or if I should’ve picked a “worse” trend visually that made more sense in the real world.

  • Estimates a difference-in-differences estimate of the effect of the Medicaid expansion on the uninsured share of the population. You may follow the lab example where we estimate the differences in one pre-treatment and one post-treatment period, or take an average of the pre-treatment and post-treatment outcomes
# Difference-in-Differences estimation
#Create dataframe
cg <- compare_states %>%
  filter(State %in% c("California", "Georgia")) %>%
  filter(year %in% c(2013, 2014))

#Pre-treatment difference
pre_diff <- cg %>%
  filter(year == 2013) %>%
  select(State, uninsured_rate) %>%
  pivot_wider(names_from = State, values_from = uninsured_rate) %>%
  summarise(diff = California - Georgia)
#post-treatment diference
post_diff <- cg %>%
  filter(year == 2014) %>%
  select(State, uninsured_rate) %>%
  pivot_wider(names_from = State, values_from = uninsured_rate) %>%
  summarise(diff = California - Georgia)

#DiD
diff_in_diffs <- post_diff - pre_diff
diff_in_diffs
##       diff
## 1 -0.01648

I think this means that California’s uninsured rate went down bt -0.01648 from the pre-treatment year (2013) to the post-treatment year (2014). I will come back and do this with averages over pre and post treatment years if I have time…

Discussion Questions

  • Card/Krueger’s original piece utilized the fact that towns on either side of the Delaware river are likely to be quite similar to one another in terms of demographics, economics, etc. Why is that intuition harder to replicate with this data? Like discussed above, the trade-off between choosing two states whose parallel trend lines are more convincing and ones whose policy context seem similar is actually more difficult than one might imagine; as we saw above, places with seemingly similar policy contexts (neighboring states with similar populations and mean rates of uninsured population) often had really wonky parallel trends that were often thrown off by just one year. Maybe it’s not even fair to assume that geographically neighboring or similar (e.g. comparing two midwestern states) states would have similar policy contexts, but it definitely seems more intuitive that Georgia is more similar to Florida than it is to California. But I think that would be the intuition of Card/Krueger that can’t be applied here; two neighboring towns are probably more likely to be similar than two neighboring states, but even that feels like a stretch. Two towns that are split by some geographical feature might have differences - see the construction of highways in Oakland.

  • What are the strengths and weaknesses of using the parallel trends assumption in difference-in-differences estimates? I think the strength is how powerful the visual aspect of parallel trends is. Even though it is subjective, if you see two lines that basically have the same exact slope,it’s hard to argue with that. My California/Georgia example may not be the best, but if it were a little better visually, the existence of the parallel trends visually will kind of rebut claims that these places are too different to compare. Meaning, if the outcome of interest is trending in the same exact ways in these two places, it seems likely that no matter how different they are, covariates are seemingly impacting them in the same way. I think the weakness for me, as mentioned, is also the visual aspect and the fact that it is not really a mathematical formulation of “parallelness,” like there’s no universal standard, it’s just sort of a visual thing that your reviewers may or may not be convinced by. I guess the upshot of that is anyone who is doing serious diff-in-diff work probably is really good at defending their analytical choices to stave off criticisms for those who are typically unconvinced by parallel trends!

Synthetic Control

Estimate Synthetic Control

Although several states did not expand Medicaid on January 1, 2014, many did later on. In some cases, a Democratic governor was elected and pushed for a state budget that included the Medicaid expansion, whereas in others voters approved expansion via a ballot initiative. The 2018 election was a watershed moment where several Republican-leaning states elected Democratic governors and approved Medicaid expansion. In cases with a ballot initiative, the state legislature and governor still must implement the results via legislation. For instance, Idaho voters approved a Medicaid expansion in the 2018 election, but it was not implemented in the state budget until late 2019, with enrollment beginning in 2020.

Do the following:

# non-augmented synthetic control
#Find delayed adopters
adopted_post2014 <- medicaid_expansion_full %>%
  filter(Date_Adopted > as.Date("2014-12-31")) %>%
  distinct(State, Date_Adopted)

view(adopted_post2014)
#I'm not sure if I should be using some theory-informed state for the synthetic control, but my gut is telling me to pick a state that adopted on New Years day of some year. Earlier in the notebook it said don't choose 2014 or 2015 for DiD, so I won't choose either. Therefore, I'll choose 01-01-2016 as my date, which hopefully will still give me enough data post-treatment.
treated_state <- "Montana"
treatment_year <- medicaid_expansion_full %>%
  filter(State == treated_state) %>%
  pull(Date_Adopted) %>%
  unique() %>%
  as.Date() %>%
  format("%Y") %>%
  as.numeric()
# Clean data
medicaid_clean <- medicaid_expansion_full %>%
  mutate(
    treatment = ifelse(State == treated_state & year >= treatment_year, 1, 0)
  )

# Run non-augmented synthetic control
syn <- augsynth(
  uninsured_rate ~ treatment,  # Outcome variable and treatment indicator
  unit = State,                # Unit (states)
  time = year,                 # Time variable (year)
  data = medicaid_clean,       # Data frame
  progfunc = "none",           # No augmentation
  scm = TRUE                   # Enable synthetic control
)
## One outcome and one treatment time found. Running single_augsynth.
#Save ATT and L2 imbalance
# Extract the ATT and L2 Imbalance from the summary text (I was having issues with getting the ATT out of the summary)
summary_text <- capture.output(summary(syn))
att_line <- summary_text[grep("Average ATT Estimate", summary_text)]
l2_line <- summary_text[grep("L2 Imbalance", summary_text)]

# Extract the numeric values for ATT and L2 Imbalance
non_aug_att <- as.numeric(str_extract(att_line, "[-]?\\d+\\.\\d+"))
non_aug_l2 <- as.numeric(str_extract(l2_line, "[-]?\\d+\\.\\d+"))

# Print out the results
cat("Non-Augmented ATT:", non_aug_att, "\n")
## Non-Augmented ATT: -0.0189
cat("Non-Augmented L2 Imbalance:", non_aug_l2, "\n")
## Non-Augmented L2 Imbalance: 0.009
# plot
plot(syn)

weights_df <- data.frame(syn$weights) %>%
  tibble::rownames_to_column('State') %>%
  rename(syn_weight = syn.weights)

# Plot weights
library(ggplot2)
library(ggthemes)

ggplot(weights_df) +
  geom_col(aes(x = State, y = syn_weight), fill = "steelblue") +
  coord_flip() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) +
  labs(
    title = "Synthetic Control Weights",
    x = "State",
    y = "Weight"
  )

Okay, only a few states are donors (is that the right terminology?) for our synthetic control.

# view each state's contribution, where weights are greater than 0
# ---------
data.frame(syn$weights) %>%
  # processing
  # ---------
  tibble::rownames_to_column('State') %>%
  filter(syn.weights > 0) %>% # filter out weights less than 0
  # plot
  # ---------
  ggplot() +
  geom_bar(aes(x = State, 
               y = syn.weights),
           stat = 'identity') +
  coord_flip() +   # flip to make it more readable
  # themes
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) +
  # labels
  ggtitle('Synthetic Control Weights') +
  xlab('State') +
  ylab('Weight')

#Use tidysynth to visualize 
library(tidysynth)

medicaid_out <- medicaid_clean %>%
  synthetic_control(
    outcome = uninsured_rate,
    unit = State,
    time = year,
    i_unit = "Montana",      # Treated unit
    i_time = 2016,           # Intervention time
    generate_placebos = TRUE
  ) %>%
  # Use a different approach to generate predictors
  generate_predictor(
    time_window = 2008:2013,
    avg_uninsured = mean(uninsured_rate, na.rm = TRUE)  # Use a summary statistic instead
  ) %>%
  generate_weights(optimization_window = 2008:2015) %>%
  generate_control()
#Plot trends for Montana vs synthetic Montana
medicaid_out %>% plot_trends()

#Plot the differences (gap between actual and synthetic Montana)
medicaid_out %>% plot_differences()

#Plot placebos (other donor states) vs observed Montana
medicaid_out %>% plot_placebos()

#Plot control weights of each other state
medicaid_out %>% plot_weights()

…uhhh why is Nevada the only donor state basically… also why is it the avg_unisnured

# augmented synthetic control
aug_syn <- augsynth(
  uninsured_rate ~ treatment, 
  unit = State, 
  time = year, 
  data = medicaid_clean,
  progfunc = "ridge", # Using ridge regression for augmentation
  scm = TRUE
)
## One outcome and one treatment time found. Running single_augsynth.
# Extract the ATT and L2 Imbalance using the same approach as before (same issue getting ATT)
summary_aug_text <- capture.output(summary(aug_syn))
aug_att_line <- summary_aug_text[grep("Average ATT Estimate", summary_aug_text)]
aug_l2_line <- summary_aug_text[grep("L2 Imbalance", summary_aug_text)]

# Extract the numeric values
aug_att <- as.numeric(str_extract(aug_att_line, "[-]?\\d+\\.\\d+"))
aug_l2 <- as.numeric(str_extract(aug_l2_line, "[-]?\\d+\\.\\d+"))

# Print out the results
cat("Augmented ATT:", aug_att, "\n")
## Augmented ATT: -0.0189
cat("Augmented L2 Imbalance:", aug_l2, "\n")
## Augmented L2 Imbalance: 0.009
cat("Non-Augmented L2 Imbalance (for comparison):", non_aug_l2, "\n")
## Non-Augmented L2 Imbalance (for comparison): 0.009
plot(aug_syn)

# barplots of weights
weights_plot <- data.frame(aug_syn$weights) %>% 
  tibble::rownames_to_column('State') %>% 
  filter(aug_syn.weights > 0) %>% 
  ggplot() + 
  geom_bar(aes(x = reorder(State, aug_syn.weights), y = aug_syn.weights), 
           stat = 'identity', fill = "steelblue") + 
  coord_flip() + 
  theme_minimal() + 
  labs(
    title = "Augmented Synthetic Control Weights", 
    x = "State", 
    y = "Weight"
  )

print(weights_plot)

…why are none of the weights negative

# Create a comparison plot between actual, non-aug synthetic, and ridge Montana
#Kasey I tried so hard to put all three on the same graph using the 6-6 notebook and it just would not work I'm sorry... so here's ridge vs  observed
# Create a tidysynth object for the ridge-augmented version
medicaid_out_ridge <- medicaid_clean %>%
  synthetic_control(
    outcome = uninsured_rate,
    unit = State,
    time = year,
    i_unit = "Montana",
    i_time = 2016,
    generate_placebos = TRUE
  ) %>%
  # Use ridge regression for augmentation
  generate_predictor(
    time_window = 2008:2013,
    avg_uninsured = mean(uninsured_rate, na.rm = TRUE)
  ) %>%
  # In tidysynth, ridge augmentation can be specified in the weights function
  generate_weights(
    optimization_window = 2008:2015,
    ridge_penalty = 0.1  # Add a ridge penalty
  ) %>%
  generate_control()

# Plot the ridge-augmented results
medicaid_out_ridge %>% plot_trends()

medicaid_out_ridge %>% plot_differences()

# Create a dataset specifically comparing the two synthetic controls, ridge and non-aug
# ---------
# Get summaries from both models
syn_sum <- summary(syn)
aug_syn_sum <- summary(aug_syn)

# Create Montana with synthetic controls - IMPORTANT: Subtract ATT not add it
montana_synmontana <- 
  # data
  medicaid_clean %>%   
  # filter just Montana 
  filter(State == "Montana") %>% 
  # Create synthetic Montana values by SUBTRACTING the ATT 
  # (since negative ATT means uninsured rate decreased)
  mutate(
    synthetic_montana = uninsured_rate + non_aug_att,
    aug_synthetic_montana = uninsured_rate + aug_att
  )

# Plot all three lines
montana_synmontana %>%
  ggplot() +
  # Montana actual
  geom_line(aes(x = year, 
                y = uninsured_rate, 
                color = 'Montana')) +
  # Synthetic Montana (non-augmented)
  geom_line(aes(x = year, 
                y = synthetic_montana, 
                color = 'Synthetic Montana')) +
  # Augmented Synthetic Montana
  geom_line(aes(x = year, 
                y = aug_synthetic_montana, 
                color = 'Augmented Synthetic Montana')) +
  # Use scale color manual to assign color values 
  scale_color_manual(values = c('Montana' = 'red', 
                               'Synthetic Montana' = 'blue',
                               'Augmented Synthetic Montana' = 'green')) +
  geom_vline(aes(xintercept = 2016)) +
  # themes
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) +
  # labels 
  ggtitle('Montana, Synthetic Montana, and Augmented Synthetic Montana') +
  xlab('Year') +
  ylab('Uninsured Rate')

Kasey I tried so hard to get synthetic Monatana to show up but I just could not…. HINT: Is there any preprocessing you need to do before you allow the program to automatically find weights for donor states?

Discussion Questions

  • What are the advantages and disadvantages of synthetic control compared to difference-in-differences estimators?

  • Answer:

  • One of the benefits of synthetic control is that the weights are bounded between [0,1] and the weights must sum to 1. Augmentation might relax this assumption by allowing for negative weights. Does this create an interpretation problem, and how should we balance this consideration against the improvements augmentation offers in terms of imbalance in the pre-treatment period?

  • Answer:

Staggered Adoption Synthetic Control

Estimate Multisynth

Do the following:

  • Estimate a multisynth model that treats each state individually. Choose a fraction of states that you can fit on a plot and examine their treatment effects.
# multisynth model states
# Prepare data for multisynth 
medicaid_multi_ready <- medicaid_expansion_full %>%
  # Create binary treatment indicator based on adoption year
  mutate(
    cbr = ifelse(!is.na(Date_Adopted) & year >= as.numeric(format(Date_Adopted, "%Y")), 1, 0)
  )

# Run multisynth with default nu
# ---------
ppool_syn <- multisynth(uninsured_rate ~ cbr, 
                        State,                        # unit
                        year,                         # time
                        medicaid_multi_ready,         # data 
                        n_leads = 6)                  # post-treatment periods to estimate

# View results 
print(ppool_syn$nu)
## [1] 0.2933713
# Get summary stats
# ---------
ppool_syn_summ <- summary(ppool_syn)

# Plot all states (this might be crowded)
# ---------
ppool_syn_summ$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_minimal() +
  theme(axis.title = element_text(),
        legend.position = "bottom") +
  ggtitle('Synthetic Controls for Medicaid Expansion') +
  xlab('Time') +
  ylab('Uninsured Rate Estimate')
## Warning: Removed 244 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 244 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Plot with facets for better readability
# ---------
ppool_syn_summ$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_minimal() +
  theme(axis.title = element_text(),
        legend.position = 'None') +
  ggtitle('Synthetic Controls for Medicaid Expansion') +
  xlab('Time') +
  ylab('Uninsured Rate Effect') +
  facet_wrap(~Level)   # facet-wrap by level (state in this case) for clearer presentation
## Warning: Removed 244 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 244 rows containing missing values or values outside the scale range
## (`geom_line()`).

  • Estimate a multisynth model using time cohorts. For the purpose of this exercise, you can simplify the treatment time so that states that adopted Medicaid expansion within the same year (i.e. all states that adopted epxansion in 2016) count for the same cohort. Plot the treatment effects for these time cohorts.
# multisynth model time cohorts
# Run multisynth by time cohorts
# ---------
ppool_syn_time <- multisynth(uninsured_rate ~ cbr, 
                             State,
                             year,
                             medicaid_multi_ready,
                             n_leads = 6,
                             time_cohort = TRUE)

# Get summary stats
# ---------
ppool_syn_time_summ <- summary(ppool_syn_time)

# Plot cohort effects
# ---------
ppool_syn_time_summ$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_minimal() +
  theme(axis.title = element_text(),
        legend.position = "bottom") +
  ggtitle('Synthetic Controls for Medicaid Expansion by Adoption Cohort') +
  xlab('Time') +
  ylab('Uninsured Rate Effect')
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_line()`).

ppool_syn_time_summ$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_fivethirtyeight() +  # Make sure this theme is loaded
  theme(axis.title = element_text(),
        legend.position = 'None') +
  ggtitle('Synthetic Controls for Medicaid Expansion') +
  xlab('Time') +
  ylab('Uninsured Rate Effect') +
  facet_wrap(~Level)
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_line()`).

Discussion Questions

  • One feature of Medicaid is that it is jointly administered by the federal government and the states, and states have some flexibility in how they implement Medicaid. For example, during the Trump administration, several states applied for waivers where they could add work requirements to the eligibility standards (i.e. an individual needed to work for 80 hours/month to qualify for Medicaid). Given these differences, do you see evidence for the idea that different states had different treatment effect sizes?

  • Answer:

  • Do you see evidence for the idea that early adopters of Medicaid expansion enjoyed a larger decrease in the uninsured population?

  • Answer:

General Discussion Questions